Skip to content

ddl: fail stale backfill task meta#68842

Open
zeminzhou wants to merge 3 commits into
pingcap:masterfrom
zeminzhou:codex/fix-stale-backfill-task-meta
Open

ddl: fail stale backfill task meta#68842
zeminzhou wants to merge 3 commits into
pingcap:masterfrom
zeminzhou:codex/fix-stale-backfill-task-meta

Conversation

@zeminzhou
Copy link
Copy Markdown
Contributor

@zeminzhou zeminzhou commented Jun 1, 2026

What problem does this PR solve?

Issue Number: close #68828

Problem Summary:

Distributed add-index backfill can pick up an existing DXF task by task key and resume it without checking whether the persisted BackfillTaskMeta still matches the current DDL job and reorg elements. If the old task meta contains stale EleIDs, the executor repeatedly returns index info not found, and the error is treated as retryable, so the task can retry forever without making progress.

What changed and how does it work?

  • Add a dedicated backfill task meta is outdated error for stale DXF backfill task metadata.
  • Validate an existing backfill task's persisted metadata before resuming it. The validation checks the job ID, schema ID, table ID, element type, and element IDs against the current reorgInfo.
  • If validation fails for a non-terminal task, mark the DXF task as failed and notify the scheduler instead of resuming the stale task.
  • Treat the stale backfill task metadata error as non-retryable in both DDL reorg retry classification and the backfill DXF executor.
  • Convert index info not found during distributed backfill executor setup into the same non-retryable stale-meta error.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Test commands:

./tools/check/failpoint-go-test.sh pkg/ddl -run 'TestOutdatedBackfillTaskMetaIsNonRetryable|TestValidateBackfillTaskMeta' -count=1
make bazel_prepare
make lint

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix a bug that could cause distributed `ADD INDEX` backfill to keep retrying forever when it resumed a stale DXF task whose metadata no longer matched the current table indexes.

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection and handling of stale backfill task metadata during distributed schema operations. Such cases are now treated as non-retryable and fail immediately, with error messages including index, table, and job identifiers for clearer diagnostics.
  • Tests

    • Added tests verifying outdated metadata is detected, treated as non-retryable (including when wrapped), and that task metadata validation behaves as expected.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jun 1, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Jun 1, 2026

@zeminzhou I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 1, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Jun 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yangkeao for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Jun 1, 2026

Hi @zeminzhou. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f207a774-2223-4d1f-bc8f-71fccb523a15

📥 Commits

Reviewing files that changed from the base of the PR and between 711597d and 0ec810e.

📒 Files selected for processing (3)
  • pkg/ddl/backfilling_dist_executor.go
  • pkg/ddl/backfilling_test.go
  • pkg/ddl/index.go
💤 Files with no reviewable changes (3)
  • pkg/ddl/backfilling_test.go
  • pkg/ddl/index.go
  • pkg/ddl/backfilling_dist_executor.go

📝 Walkthrough

Walkthrough

Adds an unexported sentinel error for outdated backfill task metadata, annotates executor "index info not found" failures with it, makes retryability checks treat that sentinel as non-retryable, and adds tests verifying non-retryability including when wrapped.

Changes

Stale Backfill Task Metadata Handling

Layer / File(s) Summary
Sentinel error, executor annotation, and executor IsRetryableError
pkg/ddl/backfilling_dist_executor.go
Add errBackfillTaskMetaOutdated and isBackfillTaskMetaOutdatedErr; annotate "index info not found" with the sentinel; short-circuit backfillDistExecutor.IsRetryableError to return false for sentinel errors.
Global retryability short-circuit
pkg/ddl/index.go
isRetryableError adds an early check: if isBackfillTaskMetaOutdatedErr(err) is true, return false immediately.
Tests and import update
pkg/ddl/backfilling_test.go
Import github.com/pingcap/errors and add TestOutdatedBackfillTaskMetaIsNonRetryable asserting the sentinel-classified error is non-retryable for both helper and executor paths, including when wrapped via errors.Annotatef.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

approved, lgtm

Suggested reviewers

  • joechenrh
  • wjhuang2016

Poem

🐰 I found a stale meta in a looping plight,
I marked it with a sentinel, gentle and bright.
When workers resume and hit that old key,
They fail fast and quiet — finally set free.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely describes the main change: handling stale backfill task metadata as non-retryable failures in DDL distributed backfill operations.
Description check ✅ Passed The PR description follows the template with Issue Number, Problem Summary, What Changed, Check List with tests selected, and a detailed Release Note addressing the bug fix.
Linked Issues check ✅ Passed The PR changes fully address issue #68828 by introducing a stale backfill task metadata error, validating persisted metadata before resuming tasks, treating stale meta as non-retryable, and adding tests verifying the error is non-retryable.
Out of Scope Changes check ✅ Passed All changes in the PR are directly scoped to addressing stale DXF backfill task metadata handling as specified in issue #68828; no out-of-scope changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/ddl/index.go (1)

3107-3115: ⚡ Quick win

Consider adding a log message when stale task metadata is detected.

The code correctly validates and fails stale tasks, but a dedicated log message at this point would improve observability when debugging production incidents.

📋 Suggested log addition
 if err := validateBackfillTaskMeta(task, reorgInfo); err != nil {
+	logutil.DDLLogger().Warn("resuming task with stale metadata, marking as failed",
+		zap.Int64("taskID", task.ID), zap.String("taskKey", task.Key),
+		zap.Int64("jobID", reorgInfo.Job.ID), zap.Error(err))
 	if !task.TaskBase.IsDone() {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/index.go` around lines 3107 - 3115, When
validateBackfillTaskMeta(task, reorgInfo) returns an error and you proceed to
fail the task via taskManager.FailTask(w.workCtx, task.ID, task.State, err), add
a structured log entry before returning that records the stale metadata error
and task identifiers; specifically log the error value, task.ID, task.State and
whether task.TaskBase.IsDone() so operators can trace why
validateBackfillTaskMeta failed—place the log just after the
validateBackfillTaskMeta check and before calling
taskManager.FailTask/handle.NotifyTaskChange.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/ddl/index.go`:
- Around line 3107-3115: When validateBackfillTaskMeta(task, reorgInfo) returns
an error and you proceed to fail the task via taskManager.FailTask(w.workCtx,
task.ID, task.State, err), add a structured log entry before returning that
records the stale metadata error and task identifiers; specifically log the
error value, task.ID, task.State and whether task.TaskBase.IsDone() so operators
can trace why validateBackfillTaskMeta failed—place the log just after the
validateBackfillTaskMeta check and before calling
taskManager.FailTask/handle.NotifyTaskChange.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4ecb5d64-f8e6-4d7f-8071-045bc8b7e986

📥 Commits

Reviewing files that changed from the base of the PR and between 147980c and a84e38c.

📒 Files selected for processing (3)
  • pkg/ddl/backfilling_dist_executor.go
  • pkg/ddl/backfilling_test.go
  • pkg/ddl/index.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 20.00000% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.3860%. Comparing base (99e1c67) to head (0ec810e).
⚠️ Report is 32 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68842        +/-   ##
================================================
+ Coverage   76.3104%   76.3860%   +0.0755%     
================================================
  Files          2041       2054        +13     
  Lines        563452     583211     +19759     
================================================
+ Hits         429973     445492     +15519     
- Misses       132563     134982      +2419     
- Partials        916       2737      +1821     
Flag Coverage Δ
integration 46.1456% <20.0000%> (+6.3671%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4610% <ø> (ø)
parser ∅ <ø> (∅)
br 65.1622% <ø> (+2.3311%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/ddl/backfilling_dist_executor.go Outdated
Comment thread pkg/ddl/index.go Outdated
if err := json.Unmarshal(task.Meta, taskMeta); err != nil {
return errors.Trace(err)
}
if err := validateBackfillTaskMeta(task, reorgInfo); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to validate and fail the task here, if DXF subtask failed as we have make the error not retryable, the whole task failed, then DDL part will notices that it's reverted and fail the DDL job

Co-authored-by: D3Hunter <jujj603@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/ddl/backfilling_dist_executor.go (1)

20-20: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove unused strings import to fix build break.

Line 20 imports strings, but it is no longer referenced. Go will fail compilation with an unused import error.

Proposed fix
 import (
 	"context"
 	"encoding/json"
-	"strings"
 
 	"github.com/pingcap/errors"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/ddl/backfilling_dist_executor.go` at line 20, Remove the unused "strings"
import from the import block in backfilling_dist_executor.go to fix the Go build
error; locate the import list in that file (the line currently reading
"strings") and delete that entry so only used packages remain imported.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/ddl/backfilling_dist_executor.go`:
- Line 20: Remove the unused "strings" import from the import block in
backfilling_dist_executor.go to fix the Go build error; locate the import list
in that file (the line currently reading "strings") and delete that entry so
only used packages remain imported.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 08471b1e-fb75-4e57-bbc6-47932041ff9d

📥 Commits

Reviewing files that changed from the base of the PR and between a84e38c and 711597d.

📒 Files selected for processing (1)
  • pkg/ddl/backfilling_dist_executor.go

@ti-chi-bot ti-chi-bot Bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ddl: stale DXF backfill task meta can make add index retry forever

2 participants